Typicality- and instance-dependent label noise-combating: a novel framework for simulating and combating real-world noisy labels for endoscopic polyp classification

Learning with noisy labels aims to train neural networks with noisy labels. Current models handle instance-independent label noise (IIN) well; however, they fall short with real-world noise. In medical image classification, atypical samples frequently receive incorrect labels, rendering instance-dependent label noise (IDN) an accurate representation of real-world scenarios. However, the current IDN approaches fail to consider the typicality of samples, which hampers their ability to address real-world label noise effectively. To alleviate the issues, we introduce typicality- and instance-dependent label noise (TIDN) to simulate real-world noise and establish a TIDN-combating framework to combat label noise. Specifically, we use the sample’s distance to decision boundaries in the feature space to represent typicality. The TIDN is then generated according to typicality. We establish a TIDN-attention module to combat label noise and learn the transition matrix from latent ground truth to the observed noisy labels. A recursive algorithm that enables the network to make correct predictions with corrections from the learned transition matrix is proposed. Our experiments demonstrate that the TIDN simulates real-world noise more closely than the existing IIN and IDN. Furthermore, the TIDN-combating framework demonstrates superior classification performance when training with simulated TIDN and actual real-world noise.


Introduction
Deep learning neural networks have achieved remarkable performance [1] due to large amounts of labeled data availability.Unfortunately, labeling for medical image classification is often time-consuming and expertdemanding, which could lead to incorrect annotations.
Noise labels refer to incorrect annotations, which can originate from inexperienced experts or mistakes made by annotators [2], particularly in endoscopic polyp classification with indistinct features.Noisy labels can mislead deep neural networks due to their strong ability to fit images and labels [3].Consequently, learning with noisy labels (LNL) methods have been developed.These techniques aim to train neural networks effectively using noisy labels while achieving high accuracy (ACC) on wellannotated test sets.Previous studies [4][5][6][7][8] developed models that handle simulated instance-independent label noise (IIN) [9].However, their effectiveness is limited in dealing with real-world label noise [10].Under the IIN paradigm, human-generated noisy labels Y is only related to the original true labels Y , i.e., the noisy transition probability is P( Y |Y ) .However, in actual scenarios, label noise is often related to the samples; for example, atypical samples are more likely to be mislabeled.This leads to the concept of instance-dependent label noise (IDN), where the transition probability becomes P( Y |Y , X) , where X denotes the input images.The IDN models the real-world scenario better, resulting in improved handling of realworld label noise compared with the IIN.Therefore, to address the challenge of learning with real-world label noise, it is crucial to simulate and combat it.
Methods for simulating label noise can be divided into IIN and IDN.The simulated IIN flips the original labels using a noise transition probability matrix [11][12][13].This process depends only on the class of the original label.Classic IIN includes random flipping and pair flipping noise.In the IDN paradigm, the simulated label noise described in ref. [14] converts the pixel value into the probability of flipping labels.This approach combines instances and the probability of flipping; however, it lacks reasonableness and ignores the typicality of the samples.Cheng et al. [15] presented a boundary noise model confined to twodimensional feature spaces.This approach is overly simplistic for complex, multidimensional spaces and falls short of accurately representing real-world label noise.The current IDN fails to consider the critical factors of typicality, particularly in medical tasks.In practical scenarios, the mislabeling of data often correlates with the typicality of the instance features.Figure 1 demonstrates how beginners might find it challenging to correctly identify small atypical lesions, as shown in the second column.Similarly, the experts and novices may have misclassified a blurred adenoma polyp in the third column.
Methods for combating label noise can be categorized into model-based or model-free approaches based on whether they model the noisy transition distribution from the ground truth to noisy labels.Model-free approaches do not model the noise paradigm (i.e., IIN or IDN).They mainly rely on the "small loss trick" [16], which suggests that the training loss for samples with noisy labels tends to be larger than for those with ground truth.This category includes methods such as MentorNet [17], co-teaching [4], and co-teaching+ [16].Sample selection methodologies for identifying labels likely to be valid for network training have emerged.Double branch networks [4,10,16] have enhanced the selection precision.However, the "small loss trick" is ineffective for the IDN paradigm [14], as neural networks may overfit complex decision boundaries.Semisupervised learning methods [5,18,19] have also been Fig. 1 Effect of typicality in real-world label noise.The ground truth above the images is derived from histopathology.The left colonoscopy image displays typical characteristics, whereas the middle one shows atypicality with a small lesion, and the right one is blurred.A human expert provides the first row of real-world noisy annotations, and the second row represents the opinion of a human beginner.Incorrect labels are marked in red color adapted for the LNL problem.These methods leverage the information within the images of noisy samples to assist in selecting and correcting noisy labels.However, these methods do not fully utilize the information in noisy labels, and the correction error for noisy labels remains uncontrolled.
In comparison, model-based methods are deemed more reliable because they theoretically guarantee an optimal classifier for modeling the distribution of true labels.These methods introduce a noisy transition matrix T (X) , where X denotes the raw instances.This matrix represents the transition probability from the latent ground truth to the observed noisy labels.Given oracle T * , a statistically consistent model can be learned by minimizing the cross-entropy loss reweighted by T * [20].However, the existing model- based studies rely on strong assumptions.Under the IIN assumption, which implies T (X) = T c×c , ref. [6] established a Softmax layer representing the IIN transition channel, which is optimized in an expectationmaximizing manner.Anchor points methods [21], which assume that the most confident samples of neural networks are predicted correctly as anchor points, estimate and fill the simple T c×c .Unfortunately, the estimated T c×c of IIN cannot improve real-world noisy labels.Under the complex IDN assumption, Xia et al. [14] assumed that the noisy transition matrix depends only on the parts [22] of the instances rather than the raw images.Part-dependent methods are ineffective for medical images with more complex features and are difficult to compose into parts.Cheng et al. [15] introduced a method designed to be robust to binary boundary noise and validated it in a two-dimensional feature space, which is inapplicable to complex medical image classification tasks.CSIDN [23] estimated T (X) according to the confidence of each sample but did not consider overconfidence from neural networks.In addition to the strong assumptions model-based IDN methods mentioned above, these methods overlook the relevance between typicality and the noisy transition matrix, which aligns with the wild.
We introduce typicality-and IDN (TIDN) to simulate real-world label noise and develop a TIDN-combating framework to combat the label noise.A TIDN is generated by disturbing the original labels according to the typicality of the samples.We propose using the distance between the samples and decision boundaries to represent typicality, calculated using a support vector machine (SVM) [24].In the TIDN-combating framework, we establish a TIDN-attention module to link features and noisy transition matrix.A recursive algorithm was proposed to enable the framework to learn the noisy transition matrix, following the spirit of the expectation-maximization (EM) algorithm.The classification network correctly predicts with corrections from the learned noisy transition matrix.Moreover, we proposed using an instance-independent noisy transition matrix to initialize the instance-dependent matrix in a recursive algorithm.
Our main contributions are as follows: • We introduce a TIDN to simulate real-world label noise closely.In the TIDN paradigm, atypical samples are more likely to be mislabeled.We propose using the distance between the samples and decision boundaries to represent typicality, calculated using an SVM.The remainder of this paper is organized as follows.Methods and experimental setups are described in detail in the Methods section.The experimental results are reported in Results section to demonstrate the effectiveness of the proposed method.In the Discussion section, we provide an extended discussion.

Methods
The workflow of the proposed methods is depicted in Fig. 2. To address the problem of combating realworld label noise, we first seek a simulated label noise to approximate the real world.After that, we design a TIDN-combating framework to combat the wellsimulated label noise.With the success in combating well-simulated noise, this framework can also address real-world label noise., where x n represents the input medical images and y n , y n ∈ {1, . . ., C} are the corresponding real-world noisy labels and ground truth, respectively.

Preliminaries
The simulation objective is to generate instancedependent noisy labels y .

Simulating the TIDN
Given a dataset x n , y n , y n N n=1 , we generated a simulated y′ n that could be in close proximity to the real-world noise y n under the IDN paradigm.In actual medical labeling scenarios, instances with typical characteristics are less likely to be mislabeled than those with atypical characteristics.Based on this observation, we propose a method that converts the per-sample distance from the classification boundary into the probability of label disturbance.Figure 3 presents a simplified illustration of the proposed TIDN model.This highlights that samples located at the classification boundaries are susceptible to mislabeling.However, it is important to note that the feature space often has a higher dimensionality in image classification tasks.
An SVM was used to calculate the classification boundary within the feature space explicitly.The boundary hyperplanes, as defined by the "one versus rest" SVM approach [24], are denoted as H i , where i ∈ {1, .., C} represents the classes.The Euclidean dis- tance to H i of each instance is denoted as d ti , where t ∈ {1, .., N } denotes the instances.The probability of an instance label being disturbed was then established using the following equation: where j = argmax i∈{1,..,C} d ti .The maximum distance from the C channels is translated into the probability of label flipping for the t-th sample.Equation (1) ensures that the greater the distance of sample t from the hyperplane, the higher the likelihood of label flipping owing to its lower typicality.is a hyperparameter for controlling the noise ratio of the simulated noisy dataset.After identifying the sample labels flipped using Eq. ( 1), we determine the specific class to which these labels are flipped.This process involves (1)

Combating label noise TIDN-combating
Having successfully simulated a TIDN that closely mirrors real-world scenarios, we introduce the TIDN-combating framework.Let X ∈ R h×w denotes the input image; Y , Y ∈ {0, 1} C represent the one-hot latent ground truth and observed labels, respectively.Let ℓ represents the cross-entropy loss for classification, and let θ denotes the parameters of the classification net- work.Directly minimizing E X, Y ℓ f θ (X), Y leads deep networks to memorize the noisy label.To learn the correct distribution guided by ground truth Y , the ora- cle noisy transition matrix Here, we introduce the structure of the TIDN-combating framework and its corresponding recursive algorithm, illustrating the construction of T * (X) .With the modeling of T * (X) , the fitting of the observed Y leads to the fitting of the latent Y . ( An overview of the TIDN-combating framework is presented in Fig. 4.During the training stage, the feature extraction backbone ω 1 outputs embedded F features.Classification head ω 2 is expected to predict ground truth Y , and the noise modeling phase is expected to construct the mapping from the embedded features to the instance-dependent noisy transition matrix T (X) , which is an intermediate product rather than a given parameter [6].The observed Y is calculated by multiply- ing T (X) with Y .In the testing phase, the predictions are output through ω 1 and ω 2 .

Structure of the TIDN-attention
To build the learning pathway from the features to a per-sample noisy transition matrix, the TIDN-attention includes a 1 × 1 convolutional layer [26] and a fully con- nected layer, as depicted in Fig. 5.This architecture is aptly termed 'attention, ' as it extracts a set of optimizable coefficients from the features, which are then applied multiplicatively to Y .Notably, Y is also obtained through features using classification head ω 2 .
Specifically, a convolutional layer was used to downsample the features.The kernel size of the 1 × 1 convolutional layer is set according to , where |t| odd indicates the nearest odd number of t.In this study, we set γ = 2, b = 1 , in accordance with the default setting outlined in ref. [27] to capture local cross-feature interaction.The downsampled features were then activated by the ReLU function, which is mapped to a C 2 × 1 vector through fully connected layers, where the activating function is a Sigmoid function.C 2 × 1 vector was then

Recursive algorithm for noise modeling
With the proposed framework, designing a recursive method to estimate T (X) is feasible, following the spirit of expectation maximization.Instead of the EM algorithm, which cannot be directly used in deep networks, the likelihood of T and Y is alternately optimized in the proposed algorithm.In the training phase, the log-likelihood is When latent variable Y is introduced, which represents the latent distribution of ground truth, the new log-likelihood becomes where C is the total class number and ω 3 represents the TIDN-attention parameters.Based on the training data, we aim to find neural network parameter ω 1 , ω 2 , ω 3 that maximize the likelihood function.We then introduce ω k−1 representing parameters in the last turn to perform (3) 3 ) ; j refers to the row number where 1 is located in the onehot label Y t .Note that the calculation of c k−1 ti generated no gradients in the network.As T (X) is generated by ω 3 and Y is predicted by ω 2 , the second term in Eq. ( 5) could be divided into two alternative terms: The final loss function to be optimized in the neural networks can then be written as the negative of the loglikelihood function where j refers to the row number, and 1 is located on the one-hot label Y t .c k−1 ti is obtained using noisy labels and the parameters in the last turn; the first term in Eq. ( 9) is directly calculated from the f (X t ; ω k 1 , ω k 3 ), which equals the i-th column of T (X) .The last term in Eq. ( 9) is the predic- tion results of f (X t ; ω k 1 , ω k 2 ) .The first term in Eq. ( 9) also represents the expectation log-likelihood function: E y (logP( Y |y, X)) , and ω k 1 , ω k 3 are optimized through gradient decent fixing ω k 2 .The latter term in Eq. ( 9) also ( 5) denotes Kullback-Leibler divergence between the prior and posterior distribution of latent true labels, and it is optimized by fixing ω 3 .The pseudocode is presented in Algorithm 1.

Initialization of parameters
The successful convergence of the network training hinges on a careful and precise initialization of its parameters for both ω 2 and ω 3 .We initialized T (X) with T using the IIN method [7].Because T (X) in our method is an intermedi- ate product of the network and not a directly adjustable parameter, it necessitates the use of a learning approach to initialize T (X) using T .Under the IIN paradigm, ref. [7] outputs recognized noisy samples using T .We utilized the recognized noisy samples to train ω 3 while fixing ω 2 .
In the proposed framework, T (X) is obtained through the propagation paths of ω 1 and ω 3 , whereas Y is acquired via the pathways of ω 1 and ω 2 .Therefore, based on the multiplicative relationship T (X)Y = Y , the network can learn the noise transition matrix of the IIN method as an initialization by fixing ω 2 and optimizing ω 3 .
In addition, a warm-up stage is required to learn the initial distribution of Y .We set warm-up epochs to optimize ω 1 and ω 2 while freezing ω 3 , as the samples with noisy labels still benefit neutral networks in an early training stage [28].

Final prediction at test phase
Because T (X) could model the transition distribution from the ground truth to the observed noisy labels, the network shown in Fig. 4 fits both the observed noisy label Y and the latent ground truth.During the training stage, the feature extraction backbone and classification head could be fed with correct supervision; the fitting of Y leads to the simultaneous fitting of the ground truth.Thus, the noise modeling phase was removed during the test phase, and the remaining feature extraction backbone and classification head output the final classification predictions.

Dataset
We selected two datasets for colonoscopy image polyp classification: Kvasir V2 [29] and a colonoscopy video classification dataset [30].The public dataset, Kvasir V2, contains 8000 images across eight categories, with 1000 images per category.These categories included dyed resection, esophagitis, ulcerative colitis, and five other classes relevant to polyp characterization.The labels sourced from clinical institutions and experts were considered accurate.The public dataset [30] comprised 152 colonoscopy videos, including 80 adenoma, 30 serrated, and 42 hyperplastic videos, amounting to three lesion types.The video lengths varied from 6 s to 76 s, with an average of approximately 30 s.The labels were derived from the histopathology results and diagnoses by expert doctors or beginners.Histopathology results provided accurate annotations, whereas diagnoses by experts and beginners were considered noisy, with noise ratios of 35.52% and 50.00%, respectively.
We employed a video classification dataset [30] with real-world label noise to validate the proposed method for simulating label noise.In this dataset, the histopathology results were considered the ground truth.The annotations made by the experts and beginners were treated as label noise with noise ratios of 50.00% and 35.52%, respectively.The effectiveness of the noise simulation methods was validated by comparing the similarity between the simulated noise and actual real-world noise.
To verify the ability of the model to combat label noise, we trained it on both simulated and real label noise data.The datasets were divided into training, validation, and test sets at an 8:1:1 ratio.The training set labels were noisy, whereas the validation and test sets contained accurate labels.

Baselines and metrics
The IIN and IDN were compared with the proposed TIDN.The IIN contains symmetric and pair-flip-label noise [11][12][13].For symmetric label noise, the labels of randomly selected instances were uniformly flipped to other classes.For symmetric noise, labels were flipped to neighboring classes for pair-flip noise.For the simulated IDN proposed in ref. [14], the probability of flipping is related to the pixels of the images, thereby generating IDN.
The comparison methods for combating label noise include the IIN and IDN methods.For the IIN methods, co-teaching+ [16] for methods of selecting clean samples, DivideMix [5] for semi-supervised learning, and noise layer [6] for IIN layers, which are similar to our work, were selected for comparison.The part-decomposing method, part-depend [14], and confident-score-based method, CSIDN [15], were selected for the IDN method.The baseline was set as a ViT trained directly on the noisy labels.
The mean total distance is a metric [31] used to measure the difference between the distributions of a realworld and a simulated noisy dataset.Let D 1 = x i , y 1 i N i and D 2 = x i , y 2 i N i be the same dataset with two types of noisy labels.The mean total distance between datasets D 1 and D 2 is defined as where y 1 i and y 2 i are soft labels representing probability distributions over {1, ..., C}.
Test ACC was chosen as the metric for combating label noise.The annotations in the test and validation sets are the ground truth to prove the robustness of LNL.The test and validation sets were blinded during training.

Implementation details
The ViT [25] was chosen as the feature extraction backbone of our methods for the image classification task, and the video transformer network [32] was chosen as the backbone for the video classification task.During training, the resolution of all input images was adjusted to 224 × 224, and the pixel values were normalized channel-wise.The dimensions of the embedded features were B × 768 , where B is the batch size.Data augmentation was performed by random cropping and vertical flipping.
The network was based on the PyTorch (version 1.9.1) framework and trained on two 12 GB NVIDIA TITAN Xp GPUs.The ViT was optimized using the stochastic gradient descent (SGD) optimizer, whereas the TIDNattention structure was optimized using the Adam optimizer.The SGD optimizer applied an initial learning rate of 0.003 divided by 0.2 every 10 epochs.The Adam optimizer set a fixed learning rate of 0.003.The image classification task batch size was set to eight, and one for the video classification.
The training set contained noisy labels, and the validation and test sets contained the ground truth.Notably, the output epoch was chosen based on the top training ACC in the last five epochs, and the validation set was blind during training, as we were studying LNL.

Results
In this section, we describe the experiments conducted on the image classification dataset with simulated label noise and the video classification dataset with real-world label noise.Validation of the simulated TIDN subsection demonstrates that the proposed simulated noise is closer to the real-world noise.Results for combating the TIDN subsection presents the classification performance of the TIDN-attention method in countering simulated noise.Results for combating real-world label noise subsection demonstrates the classification performance of the TIDN-attention method when trained with real-world label noise.Ablation study of TIDN-combating subsection presents an ablation study of the TIDN-attention module and the initialization process.

Validation of the simulated TIDN
Different approaches for simulating label noise have been applied to colonoscopy video classification datasets containing real-world label noise.The simulated label noise was compared with real-world label noise to evaluate the simulation methods.Table 1 shows the mean total distances between the existing simulated label noise and real-world label noise from a human expert (low noise level with a noise ratio of 35.52%) and a human beginner (high noise level with a noise ratio of 50.00%).The noise ratio of the human annotators was calculated based on the ACC between the ground truth and their annotations.Our simulated TIDN had the lowest mean total distance to real-world noisy labels for both the low noise ratio (0.3440) and high label noise (0.3581) scenarios.Notably, all the simulated label noises align with the noise ratio of the real-world label noise.
The T-SNE map depicting the distribution of instances from different classes is shown in Figs. 6 and 7.In Fig. 6, the two-component T-SNE map shows the distribution of labels in the feature space.The three classes are of three different colors.Different simulated label noises with the same noise ratio (50.00%, aligned with that of a human expert) and ground truth are presented.The human expert label noise was mainly distributed on the edge of the feature map, and the proposed TIDN was the closest to it from the visualization.The simulation results for the colonoscopy classification for the eight classes are presented in Fig. 7, where there is no real-world label noise.The red circle area shows that the disturbed spaces are usually at the edge of the classification boundaries, indicating that atypical samples are more easily disturbed.

Results for combating the TIDN
Methods for combating label noise were evaluated through test ACC when training with simulated and realworld label noise.The test ACC (top 5) of the different methods used for comparison is summarized in Table 2.
Notably, the training set contained only simulated noise, whereas the labels were the ground truths in the test set.The baseline indicates that the ViT is trained directly with the simulated TIDN without any methods to combat label noise.Co-teaching+ , DivideMix, and noise layer ignore the dependence of instances.Part-dependent and CSIDN methods consider the instance dependence of label noise.The TIDN-attention achieves the greatest improvement from 87.81% to 92.44% and 67.82% to 86.23% for the 15% and 40% noise ratios, respectively.Under a 70% noise ratio, DivideMix achieved the highest test ACC of 56.41%, whereas our method achieved 52.31%, compared with the baseline of 34.82%.Figure 8 illustrates the training process of the proposed method, including the curves for ACC and loss during training.The labels in the validation set were accurate, the training set labels were noisy, and the validation set data remained unseen during training.Baseline refers to the classification network being trained directly on noisy data without using methods to counter-label noise.TIDN-attention represents the proposed classification network combating label noise.Figure 8a and c shows that when trained with noisy data, the classification network gradually overfitted noisy labels as the number of epochs increased.This was evidenced by the continuously decreasing loss of the training set, whereas the loss of the validation set initially decreased and then increased.The TIDNattention method proposed in this study enables the network to fit noisy and accurate labels simultaneously.This is shown in Fig. 8b and d, where the training and validation sets show increased ACC.

Results for combating real-world label noise
Results of the real-world label noise are presented in Fig. 9.The test set contained 15 unique videos with ground-truth labels from histopathology.The baseline denotes that the network is trained directly on noisy labels without any methods for combating the label noise.The ground truth is also the upper bound because clean labels guide the network.Our proposed method achieved the same performance of 86.67% as the upper bound when combating real-world label noise based on the opinions of human beginners.It also achieved the highest improvement, from 40.00% to 80.00%, for label noise from human experts.Only the CSIDN designed for IDN effectively improved from 40.00% to 66.66%.

Ablation study of TIDN-combating
Figure 10 presents the results of the ablation experiments using the TIDN-attention algorithm.The blue solid and red dashed lines represent the results of the proposed TIDN-attention module with and without initialization, respectively.Specifically, without initialization refers to random initialization of ω 3 and with initialization refers to the method described in Initialization of parameters subsection.The green dashed line represents the scenario in which the noise transition matrix degenerates to IIN [6], assuming T(X) = T. Figure 10a presents the results for simulated noise with noise rates ranging from 15% to 70%, whereas Fig. 10b shows the outcomes for real noise at rates from 35.52% to 50%.Under various noise settings, the proposed method consistently outperformed the ablated methods for the test set ACC.

Discussion
We introduced a TIDN to simulate real-world label noise and validated this approach by comparing the mean total distance to real-world noise against that of existing simulated noises.Subsequently, we propose the TIDNcombating framework to combat real-world label noise.The performance in combating label noise was validated using simulated and real-world noisy datasets.
In this section, we describe TIDN simulations.Figure 6 illustrates that the simulated TIDN closely resembles real-world label noise.In Fig. 7c, the area marked by the red circle indicates that the samples near the decision boundaries were prone to disturbances.As the T-SNE map represents an abstract feature space, instances on the classification boundaries were effectively identified as atypical.The mean square distances in Table 1 prove that the proposed label noise is the closest to real-world noise.Because the proposed TIDN closely mimics real-world label noise, it can validate the LNL methods without realworld label noise and ground truth data.
The noise resistance performance of the TIDN-combating was demonstrated for real and simulated noise.Table 2 shows that the proposed method combats the TIDN better than the other methodologies, and Fig. 9 proves that it also effectively combats real-world noise.Co-teaching+ , DivideMix, and noise layer ignore the dependence of instances on label noise.Co-teaching+ is ineffective because the small-loss trick does not apply to IDN.DivideMix has the best performance under 70% simulated label noise; however, it performs poorly in other settings.Noise layer is limited because its theory is  based on instance-independence assumption.For methods that consider instance dependence, part-depend does not perform as well, and the part-decomposing method does not apply to complex colonoscopy images for medical use.CSIDN has a basic improvement over the baseline; however, it is still limited as the confidence score easily causes networks to fall into overconfidence.Figure 8 indicates that our method fits both noisy labels (high training ACC) and the latent ground truth (high validation ACC).For the baseline method, the performance on the validation set first increases and then declines as the training ACC increases to the point of overfitting.In contrast, in the training process of TIDN-attention, the validation ACC increases even when the training ACC increases to above 90%.The loss curve shows convergence after a sudden rise in the warm-up and initialization epochs.The training ACC and loss were calculated using noisy labels, whereas the validation ACC and loss were calculated using the ground truth.Training and validation ACC increase simultaneously because our recursive algorithm optimizes the likelihood of T(X) and the latent ground truth.The structure fits the observed noisy labels while also fitting the ground truth distribution with the assistance of T(X).Note that the validation set contains accurate labels, it remains unseen during training in actual LNL scenarios.Despite this, the experiments demonstrate that the proposed method can learn both the distribution of label noise and true labels simultaneously.Therefore, the convergence of the training loss signifies the achievement of a neural network robust to label noise.
Figure 10 shows the results of the ablation study.Comparisons between the noise layer and TIDN-attention highlight the benefits of modeling instance-dependent T(X) rather than instance-independent T. The baseline approach with no modeling of T performed poorly.The initialization of T(X) is inevitable because it outperforms the random initialization methods.This is because initialization restricts the degrees of freedom of T(X), enhancing performance.
The limitation of our work lies in the need for better initialization to limit the degrees of freedom of T(X) or to theoretically tackle the freedom problems for an instancedependent noisy transition matrix.In addition, our method can be applied to the latest classification methods, such as those based on diffusion models [33,34], to mitigate the impact of incorrect labels.

Conclusions
We introduce a novel simulated TIDN for closely approximating real-world label noise.Because TIDN aligns well with real-world scenarios, effectively combating TIDN leads to a combination of real-world label noise.Therefore, we developed the TIDN-combating framework, which includes the TIDN-attention block and a corresponding recursive algorithm.This framework simultaneously fits the observed noisy labels and latent ground truth by modeling a noisy transition matrix, ultimately leading to accurate classification predictions.Our experiments demonstrate that the TIDN closely mimics real-world noise.Furthermore, the TIDN-combating framework achieves superior ACC on the test set annotated with ground truth, whether trained on datasets with simulated or realworld noisy labels.
In a C-class classification task, we are provided with N training pairs x n , y n N n=1 and M testing pairs x n , y n M n=1

′n
that are closely aligned with the real-world noise y n .Under the IIN paradigm, y ′ n depends solely on the original true label, y n .The prob- ability that the generated noisy label belongs to a certain class j is P(y ′ n = j|y n = i) .Under the IDN paradigm, y ′ n depends on y n and the input image x n .The corresponding probability of flipping is P y ′ n = j|y n = i, x n .The objective of combating labels is to train a deep neural network classifier using the pairs x n , y perform well on the test set x n , y n M n=1

Fig. 2
Fig. 2 Workflow of the proposed methods.The ellipses represent labels, and the red labels denote the incorrect ones

Fig. 3
Fig.3TIDN.In the TIDN paradigm, samples closer to the classification boundary are considered to have weaker typicality, making their labels prone to be mislabeled.In practical scenarios, the feature space extends beyond just two dimensions

Fig. 4 Fig. 5
Fig. 4 Overview of the TIDN-combating framework.ω 1 : The vision transformer (ViT) [25] backbone follows the original setup with residual connections.ω 2 : The fully connected head of ViT.ω 3 : Parameters of a TIDN-attention block.T (X) : Per-sample noisy transition matrix with a N × Class × Class dimension.The observed noisy label Y N×1 is from multiplying T (X) N×C×C and Y N×1 .At the testing phase, the noise modeling phase is removed; the feature extraction backbone and classified head output the final prediction expected to be ground truth

Fig. 6 T
Fig.6T-SNE map showing the distribution of labels in the feature space for classifying lesions in colonoscopy videos[30]

Fig. 7 T
Fig. 7 T-SNE map showing the distribution of simulated noisy labels in the feature space for colonoscopy image classification (8 classes).Horizontal and vertical axes represent the two components of the T-SNE plot

Fig. 8
Fig. 8 ACC and loss curves during training.The horizontal and vertical axes represent the number of training epochs and the values of ACC or loss, respectively

Fig. 9
Fig. 9 Test ACC of comparative methods under real-world label noise.Baseline refers to a classification network trained directly on noisy labels without anti-noise methods.Ground truth represents the upper bound of LNL, where accurate labels guide the classification network

Table 2
Test ACC (top 5) on Kvasir V2 dataset labeled with simulated TIDN